2022-10-22

Summary

  • Percent body fat, although an accurate health indicator, is difficult and costly to measure
  • Aim: create a model to accurately estimate a male’s percentage body fat from easily made body measurements
  • Final model characteristics:
    • Backwards stepwise selection
    • BIC
    • No intercept

Percent Body Fat =

0.05795 * Age - 11.27374 * Height + 0.76046 * Abdomen - 1.84242 * Wrist

Introduction (problem overview)

  • Obesity has become an increasingly important global health problem
  • Percent body fat (PBF) is an accurate indicator of weight category (e.g., obese)
  • PBF is difficult, timely and expensive to measure

Aim: create a model that can predict a male’s PBF from more easily accessible measurements

Dataset description

  • Data set includes 15 body measurements of 250 males
  • All classified as continuous apart from age (discrete)
  • Measurements are potentially useful to estimate percent body fat

Source: BYU Human Performance Research Center http://www.byu.edu/chhp/intro.html

Data Tidying

  1. Converted variables to metric units
  2. Removed density variable - measured to directly calculate body fat percentage
  3. Removed two observations with <1% body fat
  4. BMI for each male was calculated and added

Cleaned Data and Model Selection Foundation

Number of Observations: 248

Model response variable: percent body fat

Model potential predictor variables: 14 remaining variables

Model Selection(Summary)

  • Forward vs Backward
  • Intercept
  • BMI
  • AIC vs BIC

Backwards vs Forwards AIC

  Backward Model Forward Model
Predictors Estimates p Estimates p
(Intercept) 8.27 0.317 -26.31 0.001
Age 0.06 0.020 0.04 0.150
Height -13.72 0.004
Neck -0.33 0.135
Chest -0.13 0.143
Abdomen 0.86 <0.001
Forearm 0.34 0.079
Wrist -1.70 0.001 -1.78 <0.001
Waist 2.31 <0.001
Weight -0.21 0.005
Bicep 0.28 0.068
Observations 248 248
R2 / R2 adjusted 0.741 / 0.733 0.734 / 0.728
AIC 1426.933 1429.366

Testing the Intercept

BMI with intercept

BMI without intercpet

  • with BMI vs without BMI

Include BMI

Include BMI

AIC

BIC

Performance and results

10 fold cross validation between AIC and BIC

Method RMSE MAE
AIC 4.246381 3.519275
BIC 4.237937 3.522454

Result

Our final model is backward BIC model with no intercept.

##          Age       Height      Abdomen        Wrist 
##   0.05794883 -11.27373938   0.76045631  -1.84242340
## 
## Call:
## lm(formula = Pct.BF ~ Age + Height + Abdomen + Wrist - 1, data = df3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3745  -3.0720  -0.4906   3.3440   9.5735 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## Age       0.05795    0.02283   2.538 0.011776 *  
## Height  -11.27374    3.23823  -3.481 0.000591 ***
## Abdomen   0.76046    0.03341  22.762  < 2e-16 ***
## Wrist    -1.84242    0.39360  -4.681 4.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.241 on 244 degrees of freedom
## Multiple R-squared:  0.9593, Adjusted R-squared:  0.9586 
## F-statistic:  1436 on 4 and 244 DF,  p-value: < 2.2e-16

Assumptions

Initial look

Linearity

  • Predictors and outcome assumed to have a linear relationship
  • Horizontal line should show no distinct patterns, however it shows some distinct patterns

Looking further into Linearity

  • Looking further through visualizing each predictor against Body Fat shows points fairly equal above and below the line

Independence

  • We are assuming data was collected in an independent manner, or errors are independent
  • Data collected from Brigham Young University, fairly trusted source
  • Stated 250 different men of varying ages, hopefully indicating they are all independent of each other

Normality

  • Residual error assumed to be normally distributed
  • Points on QQ plot are fairly close to the line up until the ends, which may be of concern
  • However due to large number of observations its possible Central Limit Theorem solves this assumption

Homoscedasticity

  • Residuals assumed to have constant variance
  • Points seem equally spread above and below the line
  • Line possibly not horizontal enough, however it doesn’t seem of too much concern

Discussion and conclusion

Effectiveness

Body_Fat_Percentage = 0.05795 * Age - 11.27374 * Height + 0.76046 * Abdomen - 1.84242 * Wrist

  • Number of variables: 4

  • P-Value:

  • Adjusted R-squared value: 0.9586

Limitations and future study

The consideration of physical factors alone is limiting.

  • Older people

  • Shorter people

  • People who work out regularly

Future study to overcome limitations:

  • Diet structure

  • Daily exercise time

References

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.

Horikoshi, Masaaki, and Yuan Tang. 2018. Ggfortify: Data Visualization Tools for Statistical Analysis Results. https://CRAN.R-project.org/package=ggfortify.

———. 2022. Ggfortify: Data Visualization Tools for Statistical Analysis Results. https://github.com/sinhrks/ggfortify.

Kuhn, Max. 2022. Caret: Classification and Regression Training. https://github.com/topepo/caret/.

Lüdecke, Daniel. 2022. sjPlot: Data Visualization for Statistics in Social Science. https://strengejacke.github.io/sjPlot/.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robinson, David, Alex Hayes, and Simon Couch. 2022. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.

Wickham, Hadley. 2021. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

———. 2022. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2022. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2022. DT: A Wrapper of the JavaScript Library DataTables. https://github.com/rstudio/DT.